-
Notifications
You must be signed in to change notification settings - Fork 25
Add SmolVLM #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add SmolVLM #163
Conversation
| """ | ||
| import types | ||
|
|
||
| def export_friendly_forward(self, pixel_values: torch.FloatTensor, patch_attention_mask: torch.BoolTensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp This part could not export because of the data dependent loop, I unroll it here to just handle one image. Here's the original code -
https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics3/modeling_idefics3.py#L149
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i thought it was fixed in huggingface/transformers#39614
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that one just exports the vision encoder, this one exports the get_image_features function which calls the vision encoder. I'm thinking that some code in this function might be confusing to export cc @tugsbayasgalan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh i see, prob it is the part where images are unpadded and that code is value dependant.
For my understanding, can we export the vision and the LM part separately but not the merging logic? Most get_image_features might not be 100% exportable for VLMs, so keeping it outside of export can work better on long term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we export get_image_features is that if we only exported the vision encoder, we would need to write the rest of the merging logic in C++ which is difficult to do and harder to scale. I wonder if it's possible to upstream an exportable version of get_image_features? If not, I'm happy to just monkey patch like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's possible to upstream an exportable version of
get_image_features
This would be the best solution, and if you want to submit a PR it'll very welcome. I see that we're passing pixel_attention_mask=None in all cases in this PR but the vision backbone is still not exportable?
| h_indices = torch.arange(nb_patches_h, device=pixel_values.device, dtype=torch.long) | ||
| w_indices = torch.arange(nb_patches_w, device=pixel_values.device, dtype=torch.long) | ||
|
|
||
| # This replaces bucketize(x, boundaries=[1/N, 2/N, ...], right=True) ≈ floor(x * N), which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the second change, we don't have a kernel for aten.bucketize so I compute it manually
Requires the following Transformers changes (also WIP) - huggingface/transformers#41629